This R notebook is a bioinformatics pipeline to analyze data obtained by sorting a barcoded transposon library in Ralstonia eutropha (a.k.a. Cupriavidus necator) by density, achieving a selection by PHB content. For background and details regarding the method, see Wetmore at al., mBio, 2015 and Price et al., Nature, 2018). This notebook has been adapted from a pipeline for analysis of chemostat competition experiments by Michael Jahn.
The dataset is a NextSeq 500 run, from two experiments with PHB sorting by centrifuge with either abundant or no nitrogen (NH4Cl). A problem (see below) was that the diversity in the libraries was much worse than we had seen before, especially in the N-starved sample. For this reason, the N-starvation experiment was redone (see another notebook) and only the N-rich condition was used.
Read in the main data tables with A) reads per barcode and sample (âpool countsâ), B) the fitness tables, C) the summary statistics, and D) the genome annotation. Tables were obtained by processing sequencing data with a custom BarSeq pipeline.
Overview about the number of reads per sample and their mapping. The proportion of mapped reads is lower than normal (~75 %) in all samples. While N+ samples and N- no 3 have ca 35 % mapped reads (already low), the remaining N- are around 9%. Notably N+ and N- no 3 was done in a different experiment compared to the remaining N- samples, with a different pre-cultures inoculated at different days. This suggest the low mapping was caused by the experiment, rather than during sequencing.
Manual inspection of the reads show that the vast majority of the reads have normal read structure, i. e. the presequence, a 20-nt barcode and the post-sequence. The problem is that the barcode could not be mapped to a known barcode from the TnSeq. Letâs inspect the actual barcodes we gotâŠ
To investigate the barcode distribution, we compare samples from the low- and high-mapping samples to a sample from the previous sucrose-formate MC experiment. Here we used unsorted F0 samples, that should not have been enriched for PHB-related mutants.
The total number of barcodes observed is similar between samples (ca 10M), but the number off mappable barcodes is much less in the new samples. In the new samples, there are a few barcodes with extremely high read counts, so that the 100 most abundant ones make upp more than 75% of the total reads. Conversely, there are few barcodes with what would normally be âhighâ counts, i. e. over 100 000.
Based on this, there could be said to be two separate problems.
Michael has inspected some of the un-mapped barcodes manually. He found both those that were not seen by the TnSeq data at all, and those that were seen but mapped to several locations (likely due to short read length). This means that the un-mapped codes are likely real strains in the Tn library, but that a small group of mutants have been overly enriched.
Which are these barcodes? Is it the same over-enriched and unmappable barcodes between the samples, and between the experiments?
Here we plot the the common genes among the 200 most seen BarSeq mutants in two experiments. We see that the in the two N-starved replicates, the reads are almost identical. In the N-rich samples, the read numbers are highly correlated but about half the mutants are about 30 % more prevalent in one of the samples. The reason for this is unknown.
Between the N-starved and N-rich samples, we also see that it is largely the same mutants that are enriched. Between the new samples and the old sample however, there were only few mutants that appeared together.
The conclusion is that the enrichment of these mutants depends on some common factor between the two different experiments performed before centrifuge sorting, but that these factors were not present in the MC experiment. The cause of these problems is likely somewhere in the experiment, not in the seq library preparation or in the sequencing. Two theories are
Distribution of number of reads per barcode. Many samples have less than few barcodes with more than 2^5 = 32 barcodes, will be hard to quantify these. Blue is the mean in each sample.
Similarly to the above, this is an overview about the number of barcodes per gene as a histogram. This distribution is the same for all conditions and replicates. The second plot is the number of reads per gene, averaged as median over all conditions (excluding 0 time point where counts were averaged by BarSeq pipeline).
The average of detected barcodes per gene is 3.8, and the median is 3. The average reads per gene is 384, and the median is 15.5.
There are somewhat more reads per gene if we include only the N-rich samples. The average of detected barcodes per gene is 3.7, and the median is 3. The average reads per gene is 499, and the median is 18.
We can plot log2 FC or normalized gene fitness over generations. For this type of overview it is best to summarize individual replicates (3x) to the mean or median, per time point and condition. We also add genome annotation to the summary table.
As sort of an internal control, we compare the gene fitness obtained by a complex procedure to the log2 FC of read counts, which is a very simple measure of âfitnessâ. The two variables correlate well, although there is often an offset between replicates.
Cells producing PHB become denser, meaning they fall through the density gradient to the middle or bottom fractions (F2 and F3). If look at which strains that are enriched here, i. e. cells producing PHB even when not N-starved, there were several significant hits. Interestingly, the two strongest hits in both conditions are the same and related: leuC and leuB. Otherwise, most of the hits were just present in one medium, and there are more significant scores (>2.5) in the formate condition.
Here are the 60 most F3-enriched genes:
If we only look at non-essential genes:
The lists of fructose- and formate-essential genes are identical in this case. The 8-generation score from the M experiments is plotted in each subpanel.
When N is abundant, wt cell produce little PHB and normaly band in the top fraction after centrifugation. What should be equivalent is to select mutants depleted in the F1 fraction. The leuBC genes are seen here as well.
Here are the 60 most F1-depleted genes:
If we only look at non-essential genes: The fructose- and formate-essential genes are identical also here. The 8-generation score from the MC is plotted in each subpanel.
To learn more about functional relationships between enriched genes/mutants, we can submit the gene list to the STRING interaction database and retrieve a network of probable interactions. Copied from Micheals BarSeq-pulse.
Get the most enriched genes and submit to STRING.
There is enrichment of some leucine synthesis genes in N-rich sinking cells. If we display all leucine synthesis genes, we see that only leuB, leuC and leuD are affected. They are depleted in floating cells and enriched in sinking cells. There is a such trend also in N-starvation conditions. None of these genes are obviously related to PHB synthesis.
## # A tibble: 15 x 4
## locus_tag gene_name Process Pathway
## <chr> <chr> <chr> <chr>
## 1 H16_A0477 leuB1 H16_A0477 Carbohydrate met⊠Glyoxylate and dicarboxylate mâŠ
## 2 H16_A0561 ilvE H16_A0561 Amino acid metab⊠Cysteine and methionine metaboâŠ
## 3 H16_A1041 leuA H16_A1041 Amino acid metab⊠Valine, leucine and isoleucineâŠ
## 4 H16_A1236 leuC leuC1 H16_A⊠Amino acid metab⊠Valine, leucine and isoleucineâŠ
## 5 H16_A1237 leuD leuD1 H16_A⊠Amino acid metab⊠Valine, leucine and isoleucineâŠ
## 6 H16_A1549 leuC2 H16_A1549 Amino acid metab⊠Valine, leucine and isoleucineâŠ
## 7 H16_A2133 leuB2 H16_A2133 Carbohydrate met⊠Glyoxylate and dicarboxylate mâŠ
## 8 H16_A2619 leuB3 leuB H16_A⊠Amino acid metab⊠Valine, leucine and isoleucineâŠ
## 9 H16_A2620 leuD leuD3 H16_A⊠Amino acid metab⊠Valine, leucine and isoleucineâŠ
## 10 H16_A2621 leuC3 leuC H16_A⊠Amino acid metab⊠Valine, leucine and isoleucineâŠ
## 11 H16_A2999 cysI1 H16_A2999 Energy metabolism Sulfur metabolism
## 12 H16_B0052 leuC4 H16_B0052 Amino acid metab⊠Valine, leucine and isoleucineâŠ
## 13 H16_B0081 leuA2 leuA H16_B⊠Amino acid metab⊠Valine, leucine and isoleucineâŠ
## 14 H16_B1581 metE H16_B1581 Amino acid metab⊠Cysteine and methionine metaboâŠ
## 15 H16_B2275 leuC5 leuC H16_B⊠Amino acid metab⊠Valine, leucine and isoleucineâŠ
The leuBCD genes share a common operon, with only these three genes. To check whether it could be a wider genomic effect, we can look at the surrounding genes. There is no data for flanking asd gene, but neither of the neighbors on either side fimV or livF show any enrichment or depletion on the gradient.
There are several hydrogenases that have a pattern where the in the fructose sample it is depleted in the floating fraction and in the formate sample is enriched in the sinking fraction. These are, to varying degrees, hoxA, hypB, hypE, hypF and the unnamed PHG390.
## # A tibble: 15 x 4
## locus_tag gene_name Process Pathway
## <chr> <chr> <chr> <chr>
## 1 PHG001 hoxK PHG001 Xenobiotics biodegradation a⊠Nitrotoluene degradatâŠ
## 2 PHG002 hoxG PHG002 Xenobiotics biodegradation a⊠Nitrotoluene degradatâŠ
## 3 PHG008 hoxQ PHG008 Poorly characterized General function predâŠ
## 4 PHG012 hypA PHG012 Genetic information processi⊠Chaperones and foldinâŠ
## 5 PHG013 hypB PHG013 Genetic information processi⊠Chaperones and foldinâŠ
## 6 PHG014 hypF1 hypF PH⊠Unclassified: genetic inform⊠Protein processing
## 7 PHG015 hypC PHG015 Unclassified: genetic inform⊠Protein processing
## 8 PHG016 hypD PHG016 Unclassified: genetic inform⊠Protein processing
## 9 PHG017 hypE PHG017 Unclassified: genetic inform⊠Protein processing
## 10 PHG018 hypX PHG018 Signaling and cellular proce⊠Two-component system
## 11 PHG019 hoxA PHG019 Signal transduction Two-component system
## 12 PHG020 hoxB PHG020 Hypothetical Hypothetical
## 13 PHG021 hoxC PHG021 Xenobiotics biodegradation a⊠Nitrotoluene degradatâŠ
## 14 PHG089 hoxU PHG089 Unclassified: metabolism Enzymes with EC numbeâŠ
## 15 PHG390 PHG390 Hypothetical Hypothetical
A collection of other, mosty little characterized genes
## # A tibble: 5 x 4
## locus_tag gene_name Process Pathway
## <chr> <chr> <chr> <chr>
## 1 H16_A3183 H16_A3183 Hypothetical Hypothetical
## 2 H16_A3741 H16_A3741 Hypothetical Hypothetical
## 3 H16_B2043 H16_B2043 Hypothetical Hypothetical
## 4 H16_B2171 H16_B2171 Lipid metabolism Sphingolipid metabolism
## 5 PHG390 PHG390 Hypothetical Hypothetical